首页> 外文OA文献 >Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
【2h】

Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

机译:Kepler架构nVidia GpU上格子Boltzmann解算器的内存传输优化

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of ‘performance enhancing’ features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.
机译:由于算法中普遍存在局部运算,因此解决流体流动的莱迪思玻尔兹曼方法(LBM)自然非常适合大规模并行计算的高效实现。本文介绍并分析了针对第三代nVidia GPU硬件(也称为“ Kepler”)优化的3D格子Boltzmann求解器的性能。我们回顾了以前的优化策略,并分析了不同内存类型的数据读/写时间。在LBM中,时间传播步骤(称为流传输)涉及将数据移至相邻位置,并且对并行性能至关重要。在这里,我们研究了三种使用不同硬件选项的方法。其中两个利用了GPU的“性能增强”功能;共享内存和基于开普勒的GPU中发现的新的随机播放指令。将这些与标准数据传输进行比较,该数据传输依赖于优化的存储来增加合并的访问。结果表明,更简单的方法是最有效的。由于LBM中每个线程需要大量寄存器,这限制了块的大小,因此降低了这些特殊功能的效率。获得了D3Q19 LBM求解器的详细结果,该求解器在nVidia K5000M和K20C GPU上进行了基准测试。在后一种情况下,探索了使用只读数据缓存的方法,并且峰值性能达到了每秒10.36亿个晶格更新(MLUPS)。还报告了求解器性能中周期性瓶颈的出现,据信这与硬件有关。对于两个GPU,迭代时间的峰值都以11 Hz左右的频率发生,与问题的大小无关。

著录项

  • 作者

    Mawson, M; Revell, A.;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号